Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use specialized dictionary kernels (#1178) #2808

Merged
merged 2 commits into from
Jun 30, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 28, 2022

Which issue does this PR close?

Part of #1178

Rationale for this change

Arrow-rs supports native evaluation on dictionaries for comparison operations against other dictionaries and scalars. We should make use of this to avoid hydrating dictionaries unnecessarily

What changes are included in this PR?

Tweaks the coercion rules to coerce to the dictionary type if supported by the operator

Are there any user-facing changes?

No

match (&left_value, &left_data_type, &right_value, &right_data_type) {
// Types are equal => valid
(_, l, _, r) if l == r => {}
// Allow comparing a dictionary value with its corresponding scalar value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually necessary for correctness in addition to being beneficial for performance, because ScalarValue does not have a way to encode a dictionary data type

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Jun 28, 2022
@tustvold
Copy link
Contributor Author

Unsurprisingly the performance benefits of this are quite pronounced

scheduled: select count(*) from t where dict_10_required = 'prefix#0'                                                                             
                        time:   [4.0683 ms 4.0732 ms 4.0783 ms]
                        change: [-40.646% -40.476% -40.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

tokio: select count(*) from t where dict_100_required = 'prefix#0'                                                                             
                        time:   [4.4917 ms 4.5479 ms 4.6022 ms]
                        change: [-39.027% -38.266% -37.502%] (p = 0.00 < 0.05)
                        Performance has improved.

scheduled: select count(*) from t where dict_100_required = 'prefix#0'                                                                             
                        time:   [3.8694 ms 3.8755 ms 3.8815 ms]
                        change: [-33.176% -32.985% -32.795%] (p = 0.00 < 0.05)
                        Performance has improved.

tokio: select count(*) from t where dict_1000_required = 'prefix#0'                                                                             
                        time:   [4.7944 ms 4.8326 ms 4.8687 ms]
                        change: [-31.344% -30.719% -30.083%] (p = 0.00 < 0.05)
                        Performance has improved.

@codecov-commenter
Copy link

Codecov Report

Merging #2808 (1513b88) into master (7617d78) will increase coverage by 0.11%.
The diff coverage is 96.15%.

@@            Coverage Diff             @@
##           master    #2808      +/-   ##
==========================================
+ Coverage   85.11%   85.22%   +0.11%     
==========================================
  Files         273      274       +1     
  Lines       48242    48634     +392     
==========================================
+ Hits        41059    41449     +390     
- Misses       7183     7185       +2     
Impacted Files Coverage Δ
datafusion/physical-expr/src/expressions/binary.rs 95.12% <75.00%> (+<0.01%) ⬆️
datafusion/expr/src/binary_rule.rs 84.76% <100.00%> (+0.47%) ⬆️
datafusion/core/src/physical_plan/join_utils.rs 93.61% <0.00%> (-3.20%) ⬇️
datafusion/sql/src/planner.rs 81.36% <0.00%> (ø)
...fusion/optimizer/src/single_distinct_to_groupby.rs 98.80% <0.00%> (ø)
...on/core/src/physical_optimizer/coalesce_batches.rs 100.00% <0.00%> (ø)
datafusion/optimizer/src/reduce_outer_join.rs 99.39% <0.00%> (ø)
datafusion/core/tests/sql/joins.rs 99.31% <0.00%> (+0.20%) ⬆️
datafusion/expr/src/logical_plan/plan.rs 74.40% <0.00%> (+0.29%) ⬆️
datafusion/core/src/config.rs 90.76% <0.00%> (+0.44%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7617d78...1513b88. Read the comment docs.

@tustvold tustvold marked this pull request as ready for review June 29, 2022 19:31
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold

@@ -155,14 +155,12 @@ pub fn comparison_eq_coercion(
lhs_type: &DataType,
rhs_type: &DataType,
) -> Option<DataType> {
// can't compare dictionaries directly due to
// https://github.com/apache/arrow-rs/issues/1201
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants